AITopics | lexical normalization

Collaborating Authors

lexical normalization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages

Higashiyama, Shohei, Utiyama, Masao

arXiv.org Artificial IntelligenceDec-2-2025

Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.

computational linguistic, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2505.22273

Country:

Asia > Japan > Honshū (0.46)
North America > United States > Minnesota (0.28)
North America > United States > New Mexico (0.28)
Europe > Middle East > Malta (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.92)
Information Technology > Communications > Social Media (0.68)
(2 more...)

Add feedback

ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization

Nguyen, Anh Thi-Hoang, Nguyen, Dung Ha, Van Nguyen, Kiet

arXiv.org Artificial IntelligenceJan-12-2025

ViSoLex is an open-source system designed to address the unique challenges of lexical normalization for Vietnamese social media text. The platform provides two core services: Non-Standard Word (NSW) Lookup and Lexical Normalization, enabling users to retrieve standard forms of informal language and standardize text containing NSWs. ViSoLex's architecture integrates pre-trained language models and weakly supervised learning techniques to ensure accurate and efficient normalization, overcoming the scarcity of labeled data in Vietnamese. This paper details the system's design, functionality, and its applications for researchers and non-technical users. Additionally, ViSoLex offers a flexible, customizable framework that can be adapted to various datasets and research requirements. By publishing the source code, ViSoLex aims to contribute to the development of more robust Vietnamese natural language processing tools and encourage further research in lexical normalization. Future directions include expanding the system's capabilities for additional languages and improving the handling of more complex non-standard linguistic patterns.

computational linguistic, lexical normalization, normalization, (12 more...)

arXiv.org Artificial Intelligence

2501.0702

Country:

Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.05)
North America > United States > New Mexico (0.04)
Europe > Switzerland (0.04)
(5 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media

Nguyen, Dung Ha, Nguyen, Anh Thi Hoang, Van Nguyen, Kiet

arXiv.org Artificial IntelligenceSep-30-2024

This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts for low-resource languages like Vietnamese. Social media data is rich and diverse, but the evolving and varied language used in these contexts makes manual labeling labor-intensive and expensive. To tackle these issues, we propose a framework that integrates semi-supervised learning with weak supervision techniques. This approach enhances the quality of training dataset and expands its size while minimizing manual labeling efforts. Our framework automatically labels raw data, converting non-standard vocabulary into standardized forms, thereby improving the accuracy and consistency of the training data. Experimental results demonstrate the effectiveness of our weak supervision framework in normalizing Vietnamese text, especially when utilizing Pre-trained Language Models. The proposed framework achieves an impressive F1-score of 82.72% and maintains vocabulary integrity with an accuracy of up to 99.22%. Additionally, it effectively handles undiacritized text under various conditions. This framework significantly enhances natural language normalization quality and improves the accuracy of various NLP tasks, leading to an average accuracy increase of 1-3%.

lexical normalization, normalization, weakly supervised data, (12 more...)

arXiv.org Artificial Intelligence

2409.20467

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Beijing > Beijing (0.04)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
(12 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Services (1.00)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text

Nguyen, Thanh-Nhi, Le, Thanh-Phong, Van Nguyen, Kiet

arXiv.org Artificial IntelligenceJan-31-2024

Lexical normalization, a fundamental task in Natural Language Processing (NLP), involves the transformation of words into their canonical forms. This process has been proven to benefit various downstream NLP tasks greatly. In this work, we introduce Vietnamese Lexical Normalization (ViLexNorm), the first-ever corpus developed for the Vietnamese lexical normalization task. The corpus comprises over 10,000 pairs of sentences meticulously annotated by human annotators, sourced from public comments on Vietnam's most popular social media platforms. Various methods were used to evaluate our corpus, and the best-performing system achieved a result of 57.74% using the Error Reduction Rate (ERR) metric (van der Goot, 2019a) with the Leave-As-Is (LAI) baseline. For extrinsic evaluation, employing the model trained on ViLexNorm demonstrates the positive impact of the Vietnamese lexical normalization task on other NLP tasks. Our corpus is publicly available exclusively for research purposes.

bartpho syllable, computational linguistic, normalization, (14 more...)

arXiv.org Artificial Intelligence

2401.16403

Country:

Asia > Vietnam > Bình Định Province (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
(15 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Automatic Textual Normalization for Hate Speech Detection

Nguyen, Anh Thi-Hoang, Nguyen, Dung Ha, Nguyen, Nguyet Thi, Ho, Khanh Thanh-Duy, Van Nguyen, Kiet

arXiv.org Artificial IntelligenceDec-4-2023

Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.

automatic textual normalization, normalization, textual normalization, (11 more...)

arXiv.org Artificial Intelligence

2311.06851

Country:

Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
Asia > Singapore > Central Region > Singapore (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)

Add feedback

A Fast Randomized Algorithm for Massive Text Normalization

Jiang, Nan, Luo, Chen, Lakshman, Vihan, Dattatreya, Yesh, Xue, Yexiang

arXiv.org Artificial IntelligenceOct-6-2021

Many popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. However real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate. Moreover, real-world, web-scale datasets contain hundreds of millions or even billions of lines of text, where the existing text cleaning tools are prohibitively expensive to execute over and may require an overhead to learn the corrections. In this paper, we present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data. Our algorithm relies on the Jaccard similarity between words to suggest correction results. We efficiently handle the pairwise word-to-word comparisons via Locality Sensitive Hashing (LSH). We also propose a novel stabilization process to address the issue of hash collisions between dissimilar words, which is a consequence of the randomized nature of LSH and is exacerbated by the massive scale of real-world datasets. Compared with existing approaches, our method is more efficient, both asymptotically and in empirical evaluations, and does not rely on additional features, such as lexical/phonetic similarity or word embedding features. In addition, FLAN does not require any annotated data or supervised learning. We further theoretically show the robustness of our algorithm with upper bounds on the false positive and false negative rates of corrections. Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2110.03024

Country:

North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Services (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

A Clustering Framework for Lexical Normalization of Roman Urdu

Khan, Abdul Rafae, Karim, Asim, Sajjad, Hassan, Kamiran, Faisal, Xu, Jia

arXiv.org Artificial IntelligenceMar-31-2020

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.

dataset, normalization, variation, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1017/S1351324920000285

2004.00088

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Africa > Middle East > Egypt > Giza Governorate > Giza (0.05)
(32 more...)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback